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Simple  Multiprocessor  Performance  Measurement  Techniques 
and  Examples  of  Their  Use 


Alan  Mink 
John  W.  Roberts 
Jesse  M.  Draper 
Robert  J.  Carpenter 


This  report  describes  simple  hardware  techniques  for  the  measurement  of  the  per- 
formance of  multiprocessor  computers.  A number  of  examples  of  data  obtained  using 
these  techniques  are  reported,  as  well  as  a discussion  of  the  timing  accuracy  obtain- 
able with  this  approach. 

Key  words:  Multiprocessor  computers;  parallel  computers;  performance  measurement; 
hardware. 


INTRODUCTION 


A full  expression  of  the  performance  of  multiprocessor  parallel  computers  wiU  require  the  meas- 
urement of  many  parameters.  These  include  the  utilization  of  such  resources  as  the  individual  proces- 
sors, interconnection  network,  cache  memory,  etc.  Special  hardware  to  gather  this  information  without 
perturbation  of  the  computer  under  test  is  under  development  at  the  National  Bureau  of  Standards.  Our 
long-range  goal  is  to  develop  a measurement  device  which  will  be  connected  at  many  (100-200)  points 
to  the  system  under  test.  The  signal  lines  will  be  divided  into  groups,  and  each  group  fed  into  a 
"preprocessor",  which  performs  extensive  reduction  and  interpretation  of  the  incoming  data.  The  prepro- 
cessors generally  run  independently  from  one  another,  but  when  any  one  activates  a trigger  mechanism, 
they  all  cooperate  to  place  a record  of  the  system  state  into  a large  memory  from  which  further  analysis 
can  be  performed.  Such  a measurement  device  can  be  made  sufficiently  general  that  it  may  be  used 
with  a substantial  range  of  multiprocessor  architectures. 

Since  it  will  take  many  months  to  design  and  build  the  long-term  measurement  device,  we  decided  to 
begin  by  setting  up  the  Interim  Measurement  System,  in  order  to  begin  taking  measurements  earlier. 
We  will  be  able  to  gain  first-hand  experience  in  the  problems  of  measuring  multiprocessors,  and  can  use 
the  knowledge  gained  to  guide  the  development  of  the  long-term  system.  Furthermore,  there  is  a large 
amount  of  needed  design  information  common  to  the  two  measurement  systems,  and  we  will  be  able  to 
establish  the  suitable  solutions  on  the  interim  system  before  it  will  be  possible  to  do  so  on  the  long- 
term system.  This  will  help  to  catch  major  design  weaknesses  early,  so  they  can  be  avoided  in  the  later 
system. 


This  National  Bureau  of  Standards  report  is  not  subject  to  copyright  in  the  United  States.  Certain  com- 
mercial equipment,  instmments,  or  materials  are  identified  in  this  paper  in  order  to  adequately  specify 
the  experimental  proceedure.  Such  identification  does  not  imply  recommendation  or  endorsement  by 
the  National  Bureau  of  Standards,  nor  does  it  imply  that  the  materials  or  equipment  identified  are  neces- 
sarily the  best  available  for  the  purpose. 
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LIMITATIONS  OF  TRADITIONAL  APPROACHES 


The  traditional  software  approach  to  instrument  a program  for  measuring  performance  is  to  insert 
supervisory  (system  interrupt)  calls  (to  the  operating  system)  to  obtain  a timestamp  at  the  beginning  and 
end  of  an  event,  to  store  this  information  in  a buffer  area  in  memory,  and  later  write  it  to  a file  at  some 
convenient  time,  A major  problem  with  this  approach  is  that  it  perturbs  the  original  program  with  extra 
overhead,  particularly  additional  execution  time.  This  additional  execution  time  is  required  for  the 
operating  system  to  provide  the  time  and  for  the  code  added  to  the  program  being  tested  to  store  the 
time  with  the  data  describing  the  event  This  time  overhead  is  on  the  order  of  tens  of  milliseconds,  or 
worse.  In  a multiprocessor  system,  this  overhead  is  only  suffered  by  the  processor  asking  for  the  tim- 
ing; all  others  proceed  without  delay.  This  can  greatly  distort  execution  in  interprocessor  synchroniza- 
tion situations.  Of  course  this  distortion  would  not  exist  in  a uniprocessor  situation. 


INTERIM  MEASUREMENT  SYSTEM 


The  Interim  Measurement  System,  unlike  the  traditional  software  approach,  minimally  perturbs  the 
program  by  requiring  very  little  overhead  to  capture  a timestamp.  The  Interim  Measurement  System 
code  imbedded  in  the  (users)  software  of  the  system  under  test  is  very  simple;  as  little  as  a single  as- 
signment statement  per  timestamp.  The  overall  configuration  of  the  measurement  system  is  illustrated 
in  Figure  1.  At  selected  test  points  in  the  programs,  the  processors  write  events  to  a board  called  the 
Event  Data  Card,  which  has  been  installed  in  the  normal  I/O  space  of  the  system.  These  are  edc  assign- 
ment statements.  The  testbed  computer  being  measured  has  a Multibus  card  cage  for  peripheral  dev- 
ices, so  the  Event  Data  Card  was  implemented  on  a Multibus  board  and  designed  to  respond  as  a Mul- 
tibus slave  device.  Only  a few  microseconds  are  required  for  the  computer  being  measured  to  execute 
each  edc  assignment  statement,  since  no  supervisory  calls  are  made  to  the  operating  system.  As  report- 
ed below,  the  time  overhead  for  a measurement  is  two  to  fifteen  microseconds  (NOT  milliseconds),  so 
there  is  much  less  effect  on  parallel  programming  operation. 


Overall  View 

As  each  edc  statement  is  encountered  in  the  execution  of  the  test  program  on  the  system  being 
measured,  an  event  is  said  to  occur.  The  events  contain  data  specified  by  the  user  at  the  time  they  are 
inserted  in  programs  used  to  test  the  system,  and  may  include  process  ID,  values  of  variables,  or  other 
state  information.  The  Event  Data  Card  then  adds  hardware  signals  including  a microsecond  timestamp, 
processor  identification,  and  user/supervisor  mode  status,  and  sends  a trigger  signal  to  a logic  analyzer 
to  enter  the  data  into  its  48-bit-wide  buffer.  The  use  of  hardware  probes  to  determine  the  identity  of 
the  processor  initiating  the  measurement  event  is  complicated  by  extensive  use  of  pipelining  in  the 
hardware  of  the  system  under  test  By  the  time  the  Event  Data  Card  becomes  aware  of  the  need  to 
capture  data,  many  of  the  signals  of  interest  have  vanished.  The  solution  is  to  use  flip-flops  on  the 
Event  Data  Card  to  always  capture  these  signals  while  they  are  valid,  then  hold  them  until  they  need  to 
be  entered  into  the  data  latch.  A logical  combination  of  signals  provides  the  needed  processor 
identification.  Many  other  useful  signals  on  this  and  other  systems  could  be  derived  using  this  tech- 
nique, which  is  a simple  form  of  the  preprocessing  to  be  used  in  the  long-term  measurement  system. 
Note  that  in  shared-memory  multiprocessors,  all  processors  can  write  to  at  least  some  common  range  of 
addresses,  one  of  which  can  be  used  as  the  address  for  an  Event  Data  Card.  Thus  only  a single  Event 
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Data  Card  is  needed  in  most  shared  memory  systems. 


Data  Buffer  (Logic  Analyzer) 

As  mentioned  above,  a logic  analyzer  is  used  as  a data  buffer  to  store  the  event  data.  One  com- 
plete item  of  data,  including  the  timestamp,  is  captured  every  time  it  is  triggered.  The  logic  analyzer  is 
controlled  through  an  IEEE-488  interface  by  a Sun  computer,  which  performs  statistical  analysis  and 
display  of  the  data,  as  described  below.  The  logic  analyzer  we  are  using  takes  48-bit  samples.  The  48 
input  lines  are  allocated  among  the  signals  to  be  captured  as  follows:  32  bits  for  timestamp,  1 1 bits  for 
event  data  written  by  the  user  program,  four  bits  for  processor  ID,  and  one  bit  to  indicate 
user/supervisor  mode. 

Limitations  caused  by  the  buffer.  The  logic  analyzer  link  in  the  measurement  chain  suffers  from 
significant  limitations  in  capacity  and  speed.  The  512  word  buffer  size  of  the  logic  analyzer  limits  the 
number  of  continuous  samples  that  can  be  captured.  The  effective  transmission  speed  of  the  event  data 
from  the  logic  analyzer  to  the  analysis  computer  via  the  IEEE  448  interface  is  on  the  order  of  seconds 
for  the  entire  buffer  contents.  The  logic  analyzer  cannot  capture  samples  while  it  is  transmitting  its 
buffer  contents  via  the  IEEE  488  interface.  The  logic  analyzer  cannot  report  the  number  of  samples  it 
has  captured  or  lost;  therefore,  it  is  important  to  plan  each  experiment  carefully  around  this  limitation. 
Finally,  the  number  of  "independent"  devices  in  the  data  chain  creates  at  best  annoyance,  and  at  times 
problems,  in  the  access  and  transmission  of  data. 

Other  buffering  techniques.  Had  this  not  been  an  interim  system,  we  would  have  gone  to  the  trouble 
of  providing  a substantially  larger  buffer  for  the  event  data  on  the  Event  Data  Card  itself.  In  this  case 
the  data  could  either  be  analyzed  later  on  the  system  being  tested,  or  on  an  external  machine  as  in  the 
present  case. 


Program  Perturbation 

The  Interim  Measurement  System  has  some  limitations.  The  Interim  Measurement  System  still  per- 
turbs the  the  execution  of  the  test  program,  though  a great  deal  less  than  the  traditional  approach.  As  in 
the  traditional  approach,  a programmer  must  add  code  to  the  program  to  delineate  the  events  to  be 
measured.  With  the  Interim  Measurement  System  this  is  just  a simple  assignment  statement  for  each 
timing  event.  Each  time  a new  set  of  events  is  to  be  measured,  a programmer  must  again  change  the 
locations  of  the  added  commands.  Since  activation  of  the  Event  Data  Card  requires  execution  of  code 
placed  in  the  test  programs,  it  is  inevitable  that  there  will  be  some  perturbation  of  the  results  caused  by 
the  measurement  process  itself.  The  long-term  measurement  system  will  avoid  this  problem;  in  the 
meantime,  we  can  produce  useful  results  by  taking  the  perturbations  into  account,  and  keeping  them  as 
small  as  possible.  Experiments  to  measure  this  perturbation  are  reported  below,  and  show  a cost  of 
from  two  to  ten  microseconds  per  write  to  the  Event  Data  Card  with  the  "guinea  pig"  computer  we  are 
testing. 
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Summary  of  System  Issues 

Although  the  interim  measurement  system  greatly  improves  on  the  traditional  measurement  ap- 
proach, it  still  perturbs  the  system  to  some  extent  and  has  some  other  limitations.  The  fiill-fledged 
measurement  system,  currently  under  construction,  does  not  suffer  from  these  disadvantages  and  will 
not  perturb  the  system  being  measured. 


MEASUREMENT  METHODOLOGY 


The  coordination  of  a programmer  familiar  with  the  application  program  being  measured  and  an 
analyst  is  required  to  obtain  measurements  using  the  interim  measurement  system. 


Measurement  Events 

The  programmer  and  the  analyst  determine  the  events  to  be  measured  and  the  data  that  the  pro- 
grammer will  output  to  the  Event  Data  Card  to  identify  each  event.  There  are  two  major  categories  of 
metrics  that  are  usually  measured,  intervals  and  frequencies.  Intervals  require  paired  events,  the  start  of 
the  interval  and  the  end.  Frequencies  may  require  one  or  two  events,  depending  on  the  variation  of  the 
metric.  In  the  simplest  form,  only  one  event  is  necessary  to  keep  a simple  count  In  more  complex 
variants,  two  events  are  necessary  to  determine  ratios.  The  programmer  must  insert  special  edc  meas- 
urement statements  in  the  source  code  of  the  application  program.  These  write  the  event  data  to  the 
Event  Data  Card  identifying  the  measurement  events.  The  analyst  must  write  (or  modify)  an  analysis 
program  that  will  take  the  timestamped  application-specific  data,  match  up  interval  boundaries  of  each 
event,  and  output  a statistical  summary  for  each  event. 


Running  the  Experiment 

The  programmer  and  the  analyst  then  begin  execution  of  the  application  program  on  the  multipro- 
cessor computer  under  test  At  times  previously  coordinated  with  the  programmer,  the  analyst  initiates 
a capture  program  on  the  analysis  computer  which  acquires  a copy  of  the  logic  analyzer’s  buffer  con- 
taining the  timestamped  event  data.  This  data  is  placed  in  a file  to  be  analyzed  by  a program  written 
by  the  analyst.  If  the  test  program  has  been  so  designed,  this  process  may  be  repeated  a few  times  for 
each  experiment,  to  determine  the  stability  of  the  statistical  results.  The  logic  analyzer  cannot  sample 
the  Event  Data  Card  information  while  the  analysis  computer  is  reading  out  the  data  buffer.  Therefore, 
either  the  experiment  must  be  designed  to  limit  the  amount  of  event  data  written  to  the  Event  Data 
Card  or  the  data  captured  in  any  arbitrary  window  must  be  sufficient  for  analysis. 


INITIAL  MEASUREMENTS 


A series  of  experiments  were  planned  to  test  the  Interim  Measurement  System  and  to  determine 
the  program  and  execution  overhead  involved  in  its  use.  In  the  following  discussions,  the  macro 
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edc(<data>)"  represents  the  edc  measurement  statement,  which  actually  is  coded  (in  C)  as 
*<address  of  Event  Data  Card>  = <event  data  to  write>. 


Determination  of  Measurement  Overhead 

The  frist  program  did  nothing  but  write  to  the  Event  Data  Card  to  determine  the  overhead  of  the 
edcQ  measurement  statement.  In  this  program  there  is  only  one  kind  of  event,  the  edc  statement. 
Therefore  no  encoding  of  the  data  is  necessary  and,  in  fact,  the  data  written  is  irrelevant.  Only  the 
timestamp  is  of  interest  to  determine  the  length  of  each  successive  interval.  Also  the  amount  of  data  is 
not  significant  since  whatever  window  of  data  was  captured  is  as  valid  as  those  lost.  Each  experiment 
consisted  of  three  variations  to  the  program  structure.  The  first  variation  used  a constant  in  the  edc 
statement. 

edc(OxFAO); 

edc(OxFAO); 

edc(OxFAO); 

etc. 

The  second  variation  used  a variable  in  the  edc  statement. 
edc(q); 
edc(q); 
edc(q); 
etc. 

The  third  variation  used  a constant  ORed  with  a variable  in  the  edc  statement. 
edc(proc|OxFA); 
edc(proc|OxFA); 
edc(proclOxFA); 
etc. 

This  last  variation  was  indicative  of  the  type  of  data  expected  in  other  experiments,  where  the  variable 
may  represent  the  process  number  and  the  constant  may  represent  the  specific  event. 


Measurement  of  Loop  Structures 

Data  were  taken  of  the  intervals  between  successive  edc  assignment  statements  imbedded  in  typi- 
cal programming  looping  structures.  The  second  experiment  consisted  of  a single  edc  assignment  state- 
ment imbedded  in  a DO-WHILE  loop  construct 
do 

{ edc(<one  of  the  above  three  variations  hero); 

}while(-q); 

The  interval  between  successive  executions  of  the  edc  instmction  measured  the  speed  of  the  DO- 
WHILE  construct  plus  one  edc  execution.  The  third  experiment  consisted  of  a single  edc  assignment 
statement  imbedded  in  a WHILE  loop  construct. 
while(q— ) 

{ edc(<one  of  the  above  three  variations  here>); 

}; 

The  fourth  experiment  consisted  of  a single  edc  assignment  statement  imbedded  in  a FOR  loop  con- 
struct 

for(i=l;  i<=q;  i-i-+) 

{ edc(<one  of  the  above  three  variations  here>); 

} 
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This  resulted  in  a measure  of  the  speed  of  the  FOR  construct. 


Data  Analysis 

The  analysis  program  took  the  raw  timestamp  data  and  converted  it  to  intervals.  The  first  pass  of 
the  analysis  program  computed  statistics  for  all  interval  data.  The  second  pass  of  the  analysis  program 
used  the  smallest  interval  as  a baseline  and  discarded  all  intervals  greater  than  four  times  as  large. 
Since  the  intervals  between  events  were  small  and  regular,  this  culling  operation  filtered  out  time  inter- 
vals not  attributable  to  the  application  test  code.  These  longer  intervals  are  from  sources  such  as 
interrupts, wherein  the  processor  serviced  some  other  task  in  the  middle  of  the  execution  of  the  test  loop 
structure.  The  statistics  for  this  reduced  set  of  intervals  was  computed  along  with  its  distribution.  The 
statistics  for  each  experiment  include  the  number  of  sample  points  (events),  their  range  and  mean.  If 
any  values  are  more  than  four  times  as  large  as  the  minimum  value,  they  are  excluded.  A new  range 
and  mean  are  computed  for  the  reduced  data  set.  The  distribution  of  the  data  values  for  the  culled  data 
are  provided  for  later  plotting  (Figures  2 - 13).  This  includes  the  percentage  of  data  in  each  one- 
microsecond  interval. 


Results 

Each  experiment  shows  a narrow  range  of  time  for  the  event  being  measured,  which  is  consistent 
with  the  expected  results.  One  may  initially  expect  a single  time  value  for  the  event  interval  for  such  a 
repetitious  series,  but  after  some  thought  a narrow  range  seems  more  reasonable.  First,  the  resolution  of 
the  Event  Data  Card  timestamp  is  one  microsecond.  Since  the  events  are  asynchronous  to  the  Event 
Data  Card  clock,  a plus-or-minus  one  microsecond  variation  is  possible  due  to  the  phase  difference  of 
the  two  clocks.  Second,  a few  microsecond  variation  may  be  expected  due  to  the  instmction  prefetch 
of  the  multiprocessor  computer  under  test  Due  to  the  structure  of  the  code,  no  time  variation  is  expect- 
ed based  on  the  operation  of  the  cache,  and  translation  look-aside  buffer  misses  in  the  memory  manage- 
ment unit  are  not  expected  to  be  frequent  in  this  test.  The  environment  in  which  these  initial  measure- 
ments were  made  was  that  of  an  unloaded  system  — no  other  active  users  and  only  a single  (non- 
parallel) active  process.  The  measurement  results  are  summarized  in  Table  1,  but  are  more  graphically 
presented  in  the  figures  mentioned  below. 

Simple  measurement  statements.  The  first  experiment  consisted  of  nothing  but  in-line  sequential  edc 
statements,  with  no  intervening  code  between  each  statement.  This  resulted  in  a measure  of  the  speed  of 
the  edc  statement.  This  experiment  was  conducted  with  three  variations  in  the  data  written  at  each 
event.  Two  runs  are  plotted  from  each.  In  the  first  case,  a simple  constant  is  written  (See  Figures  2a  and 
2b).  In  this,  as  in  all  other  histograms  in  this  paper,  only  the  culled  data  is  plotted.  Each  plot  shows 
the  percentage  of  the  unculled  data  points  which  were  discarded.  In  the  second  case  a simple  variable 
is  written  (See  Figures  3a  and  3b).  In  the  third  case,  a variable  ORed  with  a constant  is  written  (See 
Figures  4a  and  4b). 

Loop  structures.  Data  were  taken  of  the  intervals  between  successive  edc  statements  imbedded  in  typ- 
ical programming  looping  structures.  The  second  experiment  consisted  of  a single  edc  statement  im- 
bedded in  a DO-WHILE  loop  construct.  This  resulted  in  a measure  of  the  speed  of  the  DO-WHILE 
construct  Two  such  runs  are  plotted  for  all  three  types  of  data  written  by  the  edc  statement  (See  Fig- 
ures 5a  and  5b,  6a  and  6b,  and  7a  and  7b).  The  third  experiment  consisted  of  a single  edc  statement 
imbedded  in  a WHILE  loop  constmct.  This  resulted  in  a measure  of  the  speed  of  the  WHILE  con- 
struct. Two  such  runs  are  plotted  for  all  three  types  of  data  written  by  the  edc  statement  (See  Figures 
11a  and  11b,  12a  and  12b,  and  13a  and  13b).  The  fourth  experiment  consisted  of  a single  edc 
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statement  imbedded  in  a FOR  loop  construct.  This  resulted  in  a measure  of  the  speed  of  the  FOR  con- 
struct. Two  such  runs  are  plotted  for  all  three  types  of  data  written  by  the  edc  statement  (See  Figures 
8a  and  8b,  9a  and  9b,  and  10a  and  10b).  Subtracting  the  basic  measurement  overhead  (Figures  2 
through  4)  from  the  corresponding  times  in  Figures  5 through  13  allows  evaluation  of  the  execution 
time  of  the  loop  mechanisms  in  the  presence  of  different  types  of  imbedded  instructions.  The  appropri- 
ate columns  in  Table  1 show  the  substantial  consistency. 


CONCLUSIONS 


The  results  shovm  here  demonstrate  that  a relatively  simple  hardware  attachment  makes  it  possible 
to  measure  execution  times  of  programs,  and  small  segments  of  programs,  with  few-microsecond  accu- 
racy and  without  substantially  perturbing  the  execution  of  the  program. 
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TABLE  1 - Execution  time  measurements 
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